Introduction to Bagging and Random Forests

Zak Varty

Decision Tree Recap

  • Partition predictor space and predict modal/mean outcome in each subregion.
  • Sequential, conditional “cuts” perpendicular to predictor axes
  • Continue splitting until stopping criteria met
    • e.g. change in residual deviance is small.

Decision Tree Recap

  • Easily interpretable model
  • bias-variance trade-off
    • decision boundaries bisect (continuous) data points, so are very sensitive to individual data point values.
    • This is a low-bias high-variance model but we want one that generalises well.

Solution 1: Bagging

If \(Y_1, \ldots,Y_n\) are independent RVs with variance \(\sigma^2\) then

\[ \text{Var}(\bar Y) = \text{Var}\left(\frac{1}{n}\sum_{i=1}^{n} Y_i\right) = \frac{\sigma^2}{n}.\] This suggests that we could reduce the sensitivity of our tree-based predictions to the particular values in our dataset by:

  • Taking many training sets
  • Fitting a decision tree to each
  • Averaging the predictions made by each tree.

This is the first example of an ensemble model. Common in weather/hazard modelling.

Bagging

Issue: we only have one dataset.

Bagging is a portmanteau of Bootstrap Aggregation

  1. Re-sample original data \(x\) with replacement, \(B\) times: \(\{x^1,,x^2,x^B\}\).
  2. Fit tree \(\hat f_b(x)\) using \(x^b\) for \(b\) in \(1,\ldots,B\).
  3. Take mean (modal) prediction from all regression (classification) trees

\[ \hat f(x) = \frac{1}{B}\sum_{i=1}^{n}\hat f_b(x).\]

Claim: This has a stabilising effect, reducing sensitivity to the particular data values we observe.

Bagging - Example (Bootstrap 1)

Bagging - Example (Bootstrap 1)

Bagging - Example (Bootstrap 1)

Fit a classification tree \(\hat f_1(x)\) to the first bootstrapped data set



Bagging - Example (Bootstrap 2)

Bagging - Example (Bootstrap 2)

Bagging - Example (Bootstrap 2)

Fit a classification tree \(\hat f_2(x)\) to the second bootstrapped data set



Bagging - Example (Repeat)

Bagging - Example (Aggregate)

Take point-wise modal class over all trees as prediction.

Bagging - Summary


  • ✅ Reduced variance of final model, as compared to a single tree.
  • ✅ Built-in holdout sets for cross validation using “out of bag” points.


  • ❌ Non-parametric bootstrap means we only see the same values as in our original data.
  • ❌ Aggregating trees destroys the interpretability of the model.
  • ❌ Each of the trees we are aggregating are likely to be similar. (*)

Random Forests

Suppose we have one very strong predictor and several moderately strong predictors.


  • The first cut is going to be in the strong predictor for almost all bagged trees.
  • Leads to high “correlation” between trees (technically predictions of these trees)
  • Reduces the effective number of trees and the size of variance reduction compared to what we might initially have expected.

Random Forests

Exactly the same as bagging, but using only a random subset of \(m << p\) predictors to determine each split.

Seems like throwing away information, but can help us to reduce the dependence between the trees that we are aggregating.


Generalising what we had before:

\[ \text{Var}(\bar Y) = \frac{\sigma^2}{n} + \underset{\text{minimise this}}{\underbrace{\frac{2}{n} \sum_{i<j} \text{Cov}(Y_i, Y_j)}}.\]


Bioinformatics & multicollinearity: multiple genes expressed together, random subsetting allows us to “spread” the influence on the outcome across these genes.

Random Forest Summary

Random forests are an ensemble model, averaging the prediction of multiple regression or classification trees.

  • Bootstrap aggregation reduces variance of the resulting predictions by averaging over multiple data sets that we might have seen.

  • Variance is further reduced by considering a random subset of predictors at each split, in order to decorrelate the trees we are aggregating.

Lab: Comparing single tree, bagging and random forest for student grade prediction.

Introducing Augustus et al

Build Information

R version 4.3.3 (2024-02-29)

Platform: x86_64-apple-darwin20 (64-bit)

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: DescTools(v.0.99.58), rpart.plot(v.3.1.2), rpart(v.4.1.23), glue(v.1.8.0), ggplot2(v.3.5.1), dplyr(v.1.1.4) and palmerpenguins(v.0.1.1)

loaded via a namespace (and not attached): gld(v.2.6.6), gtable(v.0.3.5), xfun(v.0.43), lattice(v.0.22-5), vctrs(v.0.6.5), tools(v.4.3.3), generics(v.0.1.3), tibble(v.3.2.1), proxy(v.0.4-27), fansi(v.1.0.6), pkgconfig(v.2.0.3), Matrix(v.1.6-1.1), data.table(v.1.15.4), readxl(v.1.4.3), assertthat(v.0.2.1), lifecycle(v.1.0.4), rootSolve(v.1.8.2.4), compiler(v.4.3.3), farver(v.2.1.2), stringr(v.1.5.1), Exact(v.3.3), munsell(v.0.5.1), htmltools(v.0.5.8.1), class(v.7.3-22), yaml(v.2.3.8), pillar(v.1.9.0), crayon(v.1.5.3), MASS(v.7.3-60.0.1), boot(v.1.3-29), tidyselect(v.1.2.1), digest(v.0.6.35), mvtnorm(v.1.2-5), stringi(v.1.8.4), pander(v.0.6.5), purrr(v.1.0.2), showtextdb(v.3.0), labeling(v.0.4.3), forcats(v.1.0.0), zvplot(v.0.0.0.9000), fastmap(v.1.1.1), grid(v.4.3.3), colorspace(v.2.1-1), lmom(v.3.2), expm(v.1.0-0), cli(v.3.6.3), magrittr(v.2.0.3), emo(v.0.0.0.9000), utf8(v.1.2.4), e1071(v.1.7-14), withr(v.3.0.1), scales(v.1.3.0), showtext(v.0.9-7), lubridate(v.1.9.3), timechange(v.0.3.0), rmarkdown(v.2.26), httr(v.1.4.7), sysfonts(v.0.8.9), cellranger(v.1.1.0), hms(v.1.1.3), evaluate(v.0.23), knitr(v.1.45), haven(v.2.5.4), rlang(v.1.1.4), Rcpp(v.1.0.12), rstudioapi(v.0.16.0), jsonlite(v.1.8.8) and R6(v.2.5.1)